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Abstract 

Long-running applications are often subject to failures. Once failures occur, it will lead to unacceptable system overheads. 
The checkpoint technology is used to reduce the losses in the event of a failure. For the two-level checkpoint recovery 
scheme used in the long-running tasks, it is unavoidable for the system to periodically transfer huge memory context to a 
remote stable storage. Therefore, the overheads of setting checkpoints and the re-computing time become a critical issue 
which directly impacts the system total overheads. Motivated by these concerns, this paper presents a new model by 
introducing /-checkpoints into the existing two-level checkpoint recovery scheme to deal with the more probable failures 
with the smaller cost and the faster speed. The proposed scheme is independent of the specific failure distribution type and 
can be applied to different failure distribution types. We respectively make analyses between the two-level incremental and 
two-level checkpoint recovery schemes with the Weibull distribution and exponential distribution, both of which fit with the 
actual failure distribution best. The comparison results show that the total overheads of setting checkpoints, the total re- 
computing time and the system total overheads in the two-level incremental checkpoint recovery scheme are all 
significantly smaller than those in the two-level checkpoint recovery scheme. At last, limitations of our study are discussed, 
and at the same time, open questions and possible future work are given. 
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Introduction 

For large scale and long-running applications, system failures 
are inevitable. In the absence of any protective measures, the 
applications must be restarted from the beginning whenever the 
failures occur. This will lead to a large waste of system overheads 
and system resources. Therefore, the system fault-tolerant schemes 
are proposed to solve this problem [1], and one of them is the 
checkpoint recovery technology [2], [3], [4] which is a widely used 
and resultful fault-tolerant measure. During the running process of 
the task, the system saves the task execution states to a reliable 
storage device periodically. Therefore, it can recover itself from 
the last stored state whenever a failure occurs. This avoids the task 
restarting from the beginning, improves the system reliability 
greatly, reduces the system overheads significantly and shortens 
the task completion time. 

In the checkpoint technology, the checkpoint placement 
frequency is important. If the checkpoint interval is too small, 
the overheads created by setting checkpoints will result in large 
system overheads. Conversely, if the checkpoint interval is too 
large, the re-computing time and recovery time will be too long in 
the event of a failure. In this case, the checkpoint recovery scheme 
cannot achieve the desired effects and reduce the system total 
overheads as expected. So, there is a tradeoff between the 
checkpoint placement frequency and the system total overheads. 
The traditional one-level checkpoint recovery scheme [5], [6], [7] 



involves only one type of checkpoint, where each checkpoint is 
designed to tolerate the worst failure scenario. Therefore, the 
overheads of one-level checkpoints are very large. In order to 
reduce the overheads of setting checkpoints and the total system 
overheads, Vaidya [8] presented the two-level recovery scheme. In 
this scheme, two types of checkpoints, namely the ^-checkpoint 
and local checkpoint, are used to deal with the less probable 
failures and the more probable failures, respectively. The 
experimental analyses show that the two-level checkpoint recovery 
scheme can achieve lower system overheads than the one-level 
one. 

When the two-level checkpoint recovery scheme is used to the 
large scale and long-running tasks, the system needs to periodically 
transfer huge data about its running state to a remote reliable 
storage. So, the overheads of setting checkpoints and the re- 
computing time have become a critical issue, which directly 
impacts the total overheads. In order to further reduce the system 
total overheads, we propose a two-level incremental checkpoint 
recovery scheme based on the two-level checkpoint recovery 
technology. The proposed scheme sets three types of checkpoints, 
namely iV-checkpoint, m-checkpoint and /-checkpoint. The N- 
checkpoint is used to deal with the less probable or infrequent 
failures, while the »z-checkpoint and /-checkpoint are used to deal 
with the more probable or frequent failures. The main contribu- 
tions of this paper are listed as follows: (1) we introduce the third 
type of lightweight checkpoint and propose a new two-level 
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incremental checkpoint model; (2) For the two-level incremental 
checkpoint model, we give the global optimal checkpoint 
frequency function and the checkpoint placement algorithm, 
which is independent of the specific failure distribution type; (3) we 
give a method to determine the optimal two-level incremental 
checkpoint placement strategy; (4) we give the placement strategies 
and the related conclusions for the Weibull distribution and 
exponential distribution respectively, and then illustrate the fact 
that the placement algorithm is independent of the specific failure 
distribution type. Experiment results show that compared to the 
two-level checkpoint recovery scheme, the proposed scheme 
significandy reduces the transfers of the storing contents, the 
overheads of setting checkpoints and the re-computing time, and 
thereby reduces the system total overheads. 

The rest of this paper is organized as follows. In section 2, the 
related work is discussed. In section 3, the proposed two-level 
incremental checkpoint recovery scheme is described in details. 
Section 4 takes the Weibull distribution and exponential distribu- 
tion as examples to illustrate how to compute the checkpoint 
placement time instants. In section 5, the experimental analyses 
and performance analyses are presented. In section 6, limitations 
of our study, open questions and possible future work are 
discussed. Finally, Section 7 presents the conclusion. 

Related Work 

As the system scale grows larger and lager, the system reliability 
problem becomes more and more important. Scientists have 
predicted that in future high-performance and large-scale com- 
puting tasks, the most three difficult and growing problems will be 
avoiding, coping with, and recovering from failures [9] . Due to the 
fact that the computing of the tasks become more and more 
complex and the execution time become longer and longer, the 
failure becomes more and more frequent. If there is no fault 
tolerance mechanism, the applications must be re-started from the 
beginning whenever failures occur, which will result in unaccept- 
able performance overheads, especially for long-running applica- 
tions. 

The checkpoint recovery technology is used to tolerate the 
system failures, guarantee the system reliability, and ensure the 
successful completion of the long-running tasks [2] , [3] , [4] . The 
basic idea of the checkpoint recovery technology can be described 
as follows: during the running process of the task, the computation 
state is saved into the storage medium as a checkpoint file every 
once in a while; the file is read to restore to the last stored state 
whenever a system failure occurs, which avoids the task restarting 
from the beginning, reduces the system overheads and guarantees 
the successful completion of the tasks. Checkpoint placement 
strategy is a key issue in checkpoint technology, which determines 
the system overheads. If the checkpoint interval is too small, the 
overheads created by setting checkpoints will result in large system 
overheads. Conversely, if the checkpoint interval is too large, the 
re-computing time and recovery time will be too long in the event 
of a failure, which also results that the checkpoint recovery scheme 
cannot achieve the desired effects and reduce the system total 
overheads as expected. Many researchers have worked on the 
checkpoint placement problem and given a lot of excellent 
solutions. 

In the traditional one-level checkpoint model, Young [5] 
presented an optimal checkpoint and rollback recovery model, 
and obtained the first approximation of the optimal checkpoint 
interval by which the total waste time was minimized. Based on 
the Young's work, Daly [6] has proposed a more accurate cost 
function, which improved the first order approximation to a 



higher order approximation and further reduced the system 
overheads. The main contributions of Young [5] and Daly [6] lie 
in that they took the cost function of the whole execution period 
into account and established a novel derivation principle for the 
optimal checkpoint interval. Unfortunately, in their models, both 
of them assumed that random failures follow a Poisson process 
with a constant failure which cannot adequately represent the 
actual failure characteristics [10]. By deducing the checkpoint 
frequency function which optimizes the expected overhead, Ling 
et al. [7] presented an optimal one-level placement strategy. In this 
way, Ling et al. make the one-level checkpoint recovery scheme 
independent of the specific distribution and can be used for any 
failure distribution. 

Due to the high overheads of traditional one-level checkpoint 
technology, Oliner et al. [1 1] presented a cooperative checkpoint- 
ing technology that can reduce the system overheads and improve 
the system robustness. The cooperative checkpointing schedules 
the basic checkpoint placements following the traditional Young's 
one-level checkpoint model. The difference from Young's model 
lies in the technique that they use to further reduce the checkpoint 
cost, that is to say, based on the risk estimation of system failures, 
some scheduled checkpoints are adaptively skipped. Therefore, the 
performance of their cooperative checkpointing depends on the 
accurate failure prediction, which is challenging [12], [13]. 
Elnozahy et al. [14] and Naksinehaboon et al. [15] have proposed 
the incremental checkpoint model, which sets a series of 
incremental checkpoints between the traditional full checkpoints. 
The incremental checkpoint only save the states that must be used 
during the recovery process or the changed states instead of the 
whole application states, so this model can reduce the overheads of 
setting checkpoints, and then reduce the total system overheads. In 
addition, Paun et al. [16] reduced the overheads of the 
incremental checkpoint scheme by using the optimal checkpoint 
frequency function, which also achieved good results. Although it 
was considered that the scalability problem could be solved well by 
the incremental checkpointing, the incremental checkpointing 
methods are not always practical, because most of the implemen- 
tations need some system-level support in hardware and the 
underlying operating system. Therefore, to avoid the above 
implementation concerns, Agarwal et al. [1 7] presented a purely 
software-based incremental checkpoint technique by using the 
secure hash function. Their scheme does not need system-level 
support, because the computation of the hash function can be 
executed in software. 

The traditional one-level checkpoint recovery scheme can 
reduce the system overheads, but it involves only one type of 
checkpoint, and each checkpoint in the one-level checkpoint 
recovery scheme is designed to tolerate the worst failure scenario. 
Therefore, the overheads of one-level checkpoints are very large. 
In order to reduce the overheads of setting checkpoints and the 
system total overheads, Vaidya [8] presented the classic two-level 
recovery scheme, which sets two types of checkpoints, namely N- 
checkpoint and local checkpoint. The iV-checkpoint and local 
checkpoint are saved in stable storage and local disk respectively 
for different failures, and the overhead of setting an 2V-checkpoint 
is much larger than the local checkpoint. In Vaidya's scheme, the 
failure is divided into permanent failure and transient failure, and 
the permanent failure must be recovered from the iV-checkpoint. 
Vaidya's scheme uses iV-checkpoint with high setting overheads to 
deal with the less probable or infrequent failures and uses local 
checkpoint with low setting overheads to deal with the more 
probable or frequent failures. This makes the common failure be 
processed faster, and then reduces the system total overheads 
compared with the one-level checkpoint scheme. In order to 
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obtain the optimal performance, Vaidya determined the two-level 
checkpoint placement strategy for exponential failure distribution 
by numerical search. However, in reality, the exponential 
distribution fails to give a good overall fit to the failure data, 
and sometimes other distribution types can give a better fit [10], 
[17], such as the Weibull distribution. Hence, a general two-level 
checkpoint placement strategy is needed, which not only can be 
applied to the exponential failure distribution, but also can be 
applied to other distribution types [18]. This problem is still an 
unsolved open problem in this field. Later, multi-level checkpoint- 
ing system was proposed [19], which can be considered as the 
general model of the two-level checkpoint recovery scheme. Multi- 
level checkpointing can potentially deal with the case, that 
different components have different performances, by assigning 
different costs to different types of checkpoints and allowing 
adaptive resiliency between different levels. Generally, lightweight 
checkpoints are used to deal with the more probable or frequent 
failures, while more expensive checkpoints are used to deal with 
the less probable or infrequent failures. 

Since then, researchers paid more and more attentions to the 
checkpoint recovery scheme, and lots of excellent works have been 
done in these years [20]. Hilton et al. [21] studied the method how 
to achieve minimal recovery to reduce the recovery overheads. By 
using the similar idea, Refs. [1], [22], [23], [24] also provided 
complementary techniques to reduce the error probability, thus 
the probability of rollbacks was reduced. Li et al. [25] proposed a 
fast restart mechanism for checkpoint/recovery protocols in 
networked environments, which is a complementary technique 
to the multi-level checkpointing system. Cores et al. [26] and 
Akkary et al. [27] studied the scalability of the checkpoint recovery 
scheme, and proposed techniques to reduce the recovery 
overheads when the scalability of the application grows. To 
further reduce the overheads of the checkpoint recovery process, 
Cores et al. [28] carry out the study on how to reduce the size of 
the checkpoint files. Also, for large-scale distributed systems, Wei 
et al. [29] studied the use of process clones towards localizing 
recovery, and they proved that their protocol can result in 
localized recovery involving a single group when clones are 
employed. Recently, diskless checkpoint has been introduced as a 
solution to avoid the I/O bottleneck of disk-based checkpoint [30], 
[31]. However, although this method works well, the encoding 
time, the dedicated resources and the memory overhead imposed 
by diskless checkpoint are significant obstacles against its adoption. 
Checkpoint schemes implemented on practical application systems 
have also been researched. Rusu et al. [32] proposed two different 
failure recovery schemes, which are based on the coordinated 
checkpointing and the uncoordinated checkpointing, respectively. 
Then, the performance comparison of these two schemes is made 
in effectiveness and overheads, and it shows that the first method is 
better than the second one due to its lower failure rates and smaller 
overheads. Khunteta et al. [33] presented the review of the 
algorithms, which have been reported for checkpointing ap- 
proaches in mobile ad hoc network. Also, Rodriguez et al. [34] 
focused on the performance evaluation and studied the factors that 
impact the checkpoint recovery scheme, and pointed out 
meaningful conclusions about the state-of-the-art and future 
research trends in the rollback-recovery field. Rehman el al. 
[35] thought that for the system reliability, both software and 
hardware abstraction layers of a system should be involved and 
contribute its particular advantages towards highly-reliable hard- 
ware/ software system, and at the same time they proposed a novel 
compilation technique for reliability-aware software transforma- 
tions and instruction-level vulnerability estimation method. Henkel 
et al. [36] introduces the most prominent reliability concerns from 



today's points of view and roughly recapitulates the progress in the 
community so far, which is very instructional. 

Method 

In order to facilitate the description of the proposed scheme, 
some notations used frequendy in this paper are summarized in 
Table 1. 

1 Model of Two-level Checkpoint Incremental 
Checkpoint Scheme 

The two-level incremental checkpoint model is shown in Fig. 1 . 
The model contains three types of checkpoints, namely N- 
checkpoint, m-checkpoint and 2-checkpoint. We describe the 
model in detail in the following. 

In our model, the application sets ^-checkpoints periodically, sets 
an m-checkpoint after n z-checkpoints periodically, and sets an N- 
checkpoint after in Hi-checkpoints periodically. The interval 
between two neighboring A-checkpoints is called a segment. 
The first checkpoint or the beginning checkpoint after a failure is 
always an ./V-checkpoint, which saves the total states in the remote 
stable storage. The remote stable storage is assumed to be always 
failure-free, so we can recover from the Af-checkpoint no matter 
what type of failures occur. The m-checkpoint saves the 
application total states in the local disk. The overhead for saving 
application states in the local disk is much less than that in the 
remote stable storage. And, the recovery overhead from the local 
disk is also less than that from the remote stable storage. So when 
the transient failure occurs, we can recover from the m-checkpoint 
to reduce the system overheads. The j-checkpoint is also saved in 
the local disk, but it only saves the application states that have 
changed since the previous checkpoint. So the overheads of both 
setting z-checkpoint and recovering from z-checkpoint are quite 
low, which reduces the re-computing time significandy after the 
failure. We assume the overhead of setting an ,/V-checkpoint, m- 



Table 1. Notation. 





Notation 


Meaning 


o„ 


Overhead of setting an W-checkpoint 


o m 


Overhead of setting an m-checkpoint 


Oj Overhead of setting an /-checkpoint 




Recovery cost of an M-checkpoint 


^"re-compute 1 


Re-computing time when a permanent failure occurs 


7"re-compute2 


Re-computing time when a transient failure occurs 


R m 


Recovery cost of an m-checkpoint 


Ri 


Recovery cost of an /-checkpoint 


50 


Checkpoint frequency function 


Pn 


The probability that a permanent failure occurs 


m 


Number of m-checkpoints between two neighboring N- 
checkpoints 


n 


Number of /-checkpoints between two neighboring m- 
checkpoints 


id) 


The checkpoint interval between t, and t, +1 


k 


The re-computing time coefficient 


T 


The time that a failure occurs 


fO 


Probability Density Function 



doi:1 0.1 371 /joumal.pone.01 04591 .t001 



PLOS ONE | www.plosone.org 



3 



August 2014 | Volume 9 | Issue 8 | e104591 



(ftth Failure 



Two-Level Incremental Checkpoint Recovery Scheme 

(i"+l)th Failure 
Recovery 




7Ve-compute2 



^recompute 



A'-checkpoint 



m-checkpoint 



/-checkpoint 



Rollback 



Figure 1. The two-level incremental checkpoint model. 
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checkpoint and z'-checkpoint is O n , O m , Oi, respectively, and they 
meet O n >O m »0,. 

We divide the failures into permanent failure that occurs 
infrequendy and transient failure that occurs frequently. The 
permanent failure is the one with low probability, and the transient 
failure is the one with high probability. When a permanent failure 
occurs, the application must recover from the A'-checkpoint. 
Conversely, when a transient failure occurs, the application only 
need recover from the last A'-checkpoint or m-checkpoint. If there 
are i-checkpoints after the last A'-checkpoint or m-checkpoint, the 
application needs to read the z-checkpoint no matter from which 
checkpoint to recover. That is, when a failure occurs, the 
application can recover by the last A'-checkpoint and several 
related i-checkpoints or the last m-checkpoint and several related i- 
checkpoints. The z'-checkpoint can only be used with the A'- 
checkpoint or the m-checkpoint in recovering the application and 
the sole z'-checkpoint cannot recover any application. Thus, 
although there are three types of checkpoints, the two-level 
incremental checkpoint recovery scheme is not a three-level one as 
a particular case of [19]. In our paper, the overhead of recovering 
from A'-checkpoint, m-checkpoint and z'-checkpoint is R„, R m , R„ 
respectively, and they meet R n >R m »Rj. 

Similar to [15], [16], [18], the following assumptions are also 
made in this paper. 

1 . The long-running application can be interrupted by a series of 
unexpected failures, and the failure follows the probability 
density function (PDF) f(t). And the failures are independent of 
each other. 

2. The failure can be detected by a monitoring mechanism once 
the failure occurs. 

3. The first checkpoint or the beginning checkpoint after a failure 
is always an A'-checkpoint. 

4. Because the process state is changing with the time, the size of 
the checkpoint file is constantly changing. In order to simplify 
the calculation, the overheads of setting checkpoints 0,„ 0,„, 0, 
and the recovery cost of checkpoints are assumed to be constant. 
In practice, we use the average value of each parameter. 

5. The number of m-checkpoint between two neighboring N- 
checkpoints is m, and the number of i-checkpoint between two 



neighboring m-checkpoints is n, and the m and n are both 
constant if no failures occur. 
6. The failure never occurs during the re-computing and recovery 
time. 

What we should point out is that although we adopt the similar 
assumptions used in [15], [16], [18], to simplify the problem, that 
is to say, assume that m and n are constant, they vary uncertainly 
and also affect the system performance. If m becomes larger, the 
overheads of setting checkpoints will become larger. If m becomes 
smaller, the overheads of re-computing time when a permanent 
failure occurs will become smaller. The value of n has the similar 
affect on system performance. 

2 System Total overhead Function 

The system total overhead T total ovc . rhead in the long-running 
application is consists of three parts [6], [16], [18]: the overhead of 
setting checkpoints 7" s( , t _ (:lleckp()int , the re-computing time in the 
event of failures T rt ._ compute and the overhead of recovering from 



the failures T„ 



That 



i' < .'v. ; ■ 1 IS, ^total_overhead ^ set checkpoint"^ 

Next, we deduce the system total overhead 
function specifically in (0, T). We assume that the overheads 
corresponding to those three parts are T sa ch ,, ckpoint (T), T„_„, mpute 
(X) and T Tl , cov ^, (T) respectively in (0,7). 

2.1 Overhead of Setting Checkpoints. Due to the check- 
point placement procedure is a renewal process [37], therefore, the 
new cycle starts whenever a failure occurs. In order to obtain the 
optimal placement strategy of the two-level incremental check- 
point recovery scheme, we introduce the checkpoint frequency 
function. Here we first give the definition of the checkpoint 
frequency function, and then we deduce the overhead function of 
setting checkpoints of the two-level incremental checkpoint 
recovery scheme. 

Definition 1. Let s(t) be checkpoint frequency function, then. 

U+i 

s(t)dt=lj>0 (1) 

H 

where t f (i.= 1,2,...) is the zth checkpoint placement, and t 0 — 0. 



T 



T +T 

1 re-compute 1 1 ' 
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We assume T is the time when a failure occurs. According to (1), 
the number of iV-checkpoints, m-checkpoints and z-checkpoints 

in (0, T) are approximated by- 



s(ij)dr\, 



T (m+l)n 

s(n)dn and - — — 
o (w+l)(n+l) 



s{r\)dr}, respective- 



m 

ly. So, in (0, T), the total overheads of setting checkpoints are. 



^set_checkpoint(^) z 



(m+l)(n + l)J 0 



mO m T , (m+iynO, T 

(w+l)(n+l)J 0 (»i + l)(«+l).)o 



0„+mO,„ + (m + l)«Q, 
(« + !)(«+ 1) 



(2) 



s{r\)dr\ 



Tre— compute2 ~ /rrrx ? ^G(0, 1 ) 

s(r) 



(3) 



When a permanent failure occurs, the re-computing time 
Tre-computei is the interval from last JV-checkpoint to the failure 



time. T re _ c 



[impute 1 



can be expressed as (4) 



^re— computel ' 



k + (m+\)(n+l)-l 



(4) 



In summary, the total re-computing time can be expressed as. 



( re— compute 



2.2 The Re-computing time. The re-computing time is a 
period between the last recovery checkpoint and the present 
failure. For better dealing with the different scenarios, we divide 
the failures into two types. One is the permanent failure that is less 
probable and occurs infrequently, and the other is the transient 
failure that is more probable and occurs frequently. When a 
permanent failure occurs, the application must recover from the 
Af-checkpoint. Conversely, when a transient failure occurs, the 
application can recover from the last A^-checkpoint or m- 
checkpoint. If there are i-checkpoints after the last iV-checkpoint 
or wz-checkpoint, the application also needs to read the i- 
checkpoints no matter from which checkpoint to recover. We 
assume the probability of the permanent failure is^ H . As shown in 
Fig. 1 , when a permanent failure occurs, the re-computing time is 
Tre-computels while when a transient failure occurs, the re- 
computing time is r„_ ( . ompute2 . 

When a transient failure occurs, the re-computing time 
fre-compute2 is the interval from last recovery checkpoint to the where R' 



(T)=p„T I 



e— computel 



re— compute2 



failure time. The relationship between T re . c 



ompute2 



and the 



checkpoint interval is shown in Fig. 2. r re . com p ute2 can be 
expressed as (3) [18], where k is a re-computing time coefficient 
variable between (0, 1). 



Failure 



T 

re-compute2 

< ^ 



T 



s(T) 



Figure 2. The relationship between 7~ rE 
checkpoint interval. 
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pute2 and the 



k+(m + l)(n+l)-l , k ( 5 ) 

S(T) S(T) 



2.3 Overhead of Recovering From Failures. The overhead 
of recovering from failures is the time consumed from reading the 
information from checkpoint to returning to the state that the last 
checkpoint saved after a failure occurs. According to assumption 4, 
the recovery cost of iV-checkpoint, m-checkpoint and 2-checkpoint, 
namely R n , R m , Ri, axe assumed to be constant. We assume the 
probability that a failure is permanent is p n , so the probability that 
a failure is transient is (l-/? n ). Then, the overhead of recovering 
from failures can be expressed as. 



y(T)x>p„R„ + (l-p n )(R'+nR i ) 



(6) 



R m , m>\ 
R„, m = 0 



3 Optimal Checkpoint Frequency Function 

Here, we first give the definition of the system total checkpoint 
overheads function, and then we deduce the global optimal 
checkpoint frequency function through the total checkpoint 
overheads function. 

Definition 2. The total checkpoint overheads can be 
expressed as a function about the failure time T, which can be 
expressed as. 



^total_overhead(^) - 



O n +mO m + (m+l)nOi 
(w+l)(«+l) 

k + (m+\)(n+\)-\ 

+Pn — 



s{t])dr] 



s(T) 

p n R„ + (l-p n )(R'+nR i ) 



+ (1-P„) 



k_ (7) 



where R' 



R m , m>\ 
R n , m = 0' 

The time when a failure occurs is random during the application 
execution, so whenever a failure occurs, the application will recover 
from the corresponding checkpoint and place the new checkpoints, 
and the task will be also restarted after failures. Therefore, checkpoint 
placement process is a renewal reward process. We define Wj as the 
total overheads from the starting or restarting point to the ith failure. 
The total overheads of the long-running application can be expressed 
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asV, W„ where ; is the number of failures. According to the Based 011 the theorem of calculus of variations [15], if the 

, „ , , _„„ . . integral in (12) has a minimum value, (12) must satisfy Euler- 

theorem ot a renewal reward process \5 7, we obtain. . . ,,' 

Lagrange in (15) 

llm ELi^ = ^0 ?4(fe = 0 (13) 

t^co t E(T X ) K ' dy drdy 1 ' v ; 



T\ is the time when the first failure occurs. The left hand side of 
the above equation represents the total average overheads, and it is 
a function of the average overheads in the first circle, E{W\). The 
Equation (8) suggests that minimizing the total average overheads 
is equivalent to minimizing the overheads from the starting point 
to the first failure. We define f(t) as the probability density function 
of the failure, then the average checkpoint overhead in the first 
circle is described as follows: 



-£"( ^total .overhead ) 



= f \ 0 " +r ' 

Jo 1 0 



0„ +mO m + (m+ \)nOi 



+Pn 



m+l)(n+l) 
k+(m+l)(n + l)-l 



s{t})dr] 



"(I ~Pn) 



(9) 



s(T) ^ '"'s(T) 

+ Pn R n + (l-p n )(R'+nR i )]f(t)dt 



we can 



get the optimal 



By solving the minimum of (9), 
checkpoint frequency function s(f/) opt . 

The conclusion of the optimal checkpoint frequency function is 
shown as Theorem 1. 

Theorem 1. The optimal checkpoint frequency function that 
minimizes the global average checkpoint overhead can be expressed 
as 



s(t) 



opt 



(10) 



(m + l)(n+l)\p„(mn+m+n+k) + (l-p„)k] / f(t) 



O n +mO m +(m+ X)nOj 



\-F(t) 



Proof. Let y(t)-- 
obtain 



s(rf)dri. By substituting it into (9), we 



Taking the partial derivative of g with respect to y and y' 
respectively, we have 



8g O n + mO m + (m+\)nO, 



dy 



(m+l)(«+l) 



fit) 



dg k+p n [(m+l)(n+l)-l] 



dy 



[/(Of 



fit) 



(14) 



(15) 



By substituting (14) and (15) into (13) and integrating on both 
sides of (15) on the interval (0, t), we obtain 



0„ + mO m + ( m+ 1) nOi 
( m+ 1) ( n+ 1) 
k + p„ [( m+ 1) ( n+ 1)- 1] 



+ 



y'it) 



F( t) (16) 

fit) = c 



where C is a constant. Because the function y(t) satisfies the 
conditions in the following. 



y(Q) = 0 
lhn^=0 



Applying the second condition in (17) to (16), we obtain. 

c _ 0„+mO m +(m+l)nOi 
(w+l)(«+l) 



(17) 



(18) 



E(T t 



total_ovcrhcadJ 



,0„+mO m + (m+\)nOi 



o (w+l)(n+l) 
k + (m+\)(n+l)-l 



yit) 



+Pn 



y'it) 

+ Pn R n + (l-p n )(R'+nR i )}fit)dt 



By substituting (18) into (16) we can get. 



y'it) = 



(m+l)(n+l)\p„(mn + m + n + k) + (l-p„)k] / f(t) 



O n +mO m +(m+ X)nOj 



I -Fit) 
(19) 



We let 



, p„+mO m + (m+l)nOj 

g(y,y ,t) = y(t) 

(m+ 1)(m+ 1) 



k+(m+l)jn+l)-l k (12) 

+p " y'it) +{ Pn) y'it) 

+p„R„ + (l - Pn )(R' + nRi)\f(t) 



Because s(t) =y'(t), the optimal checkpoint frequency function 
that minimizes global average checkpoint overhead can be 
expressed as (10). 

After obtaining the global optimal checkpoint frequency 
function, the checkpoint number m, n and the average checkpoint 
overhead £(T tota i_ over h e ad) of me two-level incremental checkpoint 
placement strategy are determined. We can compute the optimal 
checkpoint placement time through the optimal checkpoint 
frequency function. If k and the minimum of m, n are obtained, 
the checkpoint placement strategy is determined finally. 
Before determining the checkpoint placement time we should 
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give the method to estimate the expected re-computing time 
coefficient k. 

4 Estimation of Expected Re-computing Time 
Coefficient^ 

As shown in Fig. 2, we can use the re-computing time 
7 n r< .-<< ) mpute2 an d the checkpoint interval to estimate the re- 
computing time coefficient k. In addition, it is obvious that the 
re-computing time T re _ CO mpute2 is a random variable depending on 
the time when the failure occurs. Therefore, if we know the 
distribution of the time between failures, then T re . c „ mputt , 2 can be 
estimated, and then k also can be estimated. 

Definition 3. The re-computing time coefficient k is the ratio 
between the re-computing time T rt ._ compute2 and the checkpoint 
interval in which a failure occurs. So, the re-computing time 
coefficient k can be expressed as. 



' re-compute2 

U+i — u 



T-U 
ti+i — u 



(20) 



where T is the time when the failure occurs. In order to estimate k, 
we first need the following definition to estimate the expected re- 
computing time T re _ compute2 for each checkpoint interval. 

Definition 4. Excess life is a random variable, Z>0, which 
denotes system survival until time t+Z given that it survives till time 
I. We respectively denote the cumulative distribution function 
(CDF), the probability density function (PDF) and the expected 
value of the excess life Z as follows. 



F(t + z\t) = P(T<t + z\T>t) 



f(t + z\t)- 



E{z)- 



dF(t+z\t) 
dz 



zf(t+z\t)dz 



(21) 



(22) 



(23) 



In our model, t t is the rth checkpoint placement. The re- 
computing time T re _ (:ompute2 during the interval is a random 
variable such that its value is in the interval (0/j+r^). According to 
Definition 4, the expected value of re-computing time T re _ (:om p Ute2 
can be expressed as. 



E(T[ 



re-compute2 



ti +l -'zf( tl +z\ ti )dz 
Jo'' +1 ~'' f(t,+z\t,)dz 



(24) 



Therefore, the expected k of the ith checkpoint interval, kj, is. 

(25) 



^(-^re-compute2) 



U+i — U 



Hence, the expected re-computing time coefficient is. 



EL** 



E 



(26) 



where P, = P(t,<T<ti + \\T>tij and N is the number of the 
checkpoints. The method to estimate the re-computing time 
coefficient k is given by (26), therefore the minimum of in and n is 
obtained, and then the two-level incremental checkpoint placement 
strategy is determined finally. Next, we give a method to determine 
the two-level incremental checkpoint placement strategy. 

5 Determine Two-level Incremental Checkpoint 
Placement Strategy 

From (10) we can see that the re-computing time coefficient k, 
the number of m-checkpoint m and the number of 2-checkpoint n 
are closely related, and therefore, in practice, we have to find k, m 
and n at the same time. In the following we give the method to 
estimate k and the minimum of m and n. 

Algorithm 1. Algorithm to estimate k and the minimum of m, 

n: 

Step 1: Initialize the parameter k, m and n. Let £j n j = 0.5, 
JB in ; = 1 , ni n i = 1 . (when 7n — 0 or n = 0 the two-level incremental 
checkpoint recovery scheme degenerates to the two-level check- 
point recovery scheme, so the value m = 0 or n = 0 has no 
meanings.). 

Step 2: Input k ini , m mi and n lni . Calculate the optimal checkpoint 
frequency function using (10). Output s(t) opt . 

Step 3: Input s(t) opt . Calculate the minimum of m and n using 
(9). Output m min and w min . 

Step 4: Input k^i, w m in and n min . Calculate the checkpoint 
placement time relating to k mi , m min and n min using (1) and (10). 
Output l\, t 2 ,..., t N . 

Step 5: Input t lt t 2 ,..., t N . Calculate the expected re-computing 
time coefficient using the (24), (25) and (26). Output k. 

Step 6: If k m i = k, set k = k m \=k, the algorithm ends. 
Otherwise, set k m i=k, and return to step 2. 

When h and the minimum of m and n are determined, the 
checkpoint placement time can be calculated using (1), and then 
the two-level incremental checkpoint placement strategy is 
determined finally. 

We can see that the above derivation processes about the 
optimal frequency function s(t) opt , k and the minimum of m and n 
are not specific for a certain kind of failure distribution, but only 
involve the abstract form of distribution functions f(t). Therefore, 
the checkpoint placement method does not depend on specific 
failure distribution types, and the method can be applied to 
different failure distribution types, such as the Weibull distribution, 
exponential distribution and so on. When the method is applied to 
specific failure distribution type, we only need to replace fit) with 
the specific failure distribution. 

Examples 

Because the two-level incremental checkpoint recovery scheme 
proposed in this paper is independent of the failure distribution 
type, it is applicable to different failure distribution types. And 
thus, we can calculate the checkpoint placement time under any 
failure distribution type. Although the failure distribution types are 
various, the methods to calculate the checkpoint placement time 
for different distribution are similar. The Weibull distribution and 
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exponential distribution fit the actual failure features best, and 
therefore, we take them as examples to illustrate how to calculate 
the two-level incremental checkpoint placement time, and then to 
determine the checkpoint placement strategy. 

Whether the failure follows the Weibull distribution or 
exponential distribution, when we want to determine the 
checkpoint placement time using the algorithm mentioned in 
Section 3.4, we first need to calculate the minimum of m and n, 
namely m m ; n and n m ; n , respectively. About calculating t\, t 2 ,..., In 
in step 4, we give the following conclusions. 

In order to calculate the checkpoint placement time better, we 
first give CDF and PDF of the Weibull and exponential 
distribution. The CDF and PDF of the Weibull distribution are 

W)= 1 -e-' P/ *"and / weib uu(0 = I *){t j *f ~ x e~ > a? , re- 
spectively, where a is the scale parameter and /? is the shape 
parameter. The CDF and PDF of the exponential distribution are 
-F e xp(0 = 1 — e~ A 'and f eX p(t) = oce ~ h , respectively, where X is the 
rate parameter. 

Theorem 2. Let l t (i = 1,2,...) be the checkpoint placement time, 
such that t 0 = 0. When the failure distribution follows the Weibull 
distribution, ti can be expressed as 



U = -. — i 

A-exp 



(32) 



where A ap = 



(m+ l)(n+l)\p n (mn+m+n+k) + (l —p„)k] ^ 
O n + mO m + (m+l)nOi 
Proof. When the failure distribution follows the exponential 
distribution, by substituting the CDF and PDF of the exponential 
distribution into (10), we obtain the optimal checkpoint frequency 
function for exponential distribution. 



40, 



opt — ^exp 



(33) 



where ^ exp = 



(m+ l)(n+l)\p n (mn+m+n+k) + (l —p„)k] 



0„+mO m + (m+l)nOi 
According to Definition 1, we have 



\Tk. 



f'i+l f'i+l 

| i(0opt*= A exp dt=A exp (t i -t i . l ) = l (34) 

J ti J ti 



(27) 



Therefore, 



where A wb ]-- 



l( m+l)(n+l)\p„(mn + m + n + k) + (l-p„)k] p_ 
O n +mO m + (m+l)nOi V a? • 

Proof. When the failure distribution follows the Weibull 
distribution, by substituting the CDF and PDF of the Weibull 
distribution into (10), we obtain the optimal checkpoint frequency 
function for the Weibull distribution. 



■+U-1 



By induction, we obtain. 



ti=-7—l 



(35) 



(36) 



*(0opt = y4 wbrf 2 



(28) 



(m+l)(n+l)\p n (mn+m+n+k) + (l-p n )k] [J 



where A kU = 

O n +mO m + (m+l)nOj 
According to Definition 1 , we have. 



f'i+l f'i+l 
| s(t) opt dt = 



«-l 1A v, t±l 11+1 

AwbV t L ^dt= 2 1 ^(t i 2 -t^)=\ (29) 



Therefore, 



where i — 0,1,2,..., and to = 0. 

Using Theorem 2 and Theorem 3, we can calculate the 
checkpoint placement time for the Weibull distribution and 
exponential distribution. Next, we analyze the changes 
between two neighboring checkpoint intervals when the failure 
follows the Weibull distribution and exponential distribution 
respectively. 

Theorem 4. Let I(i) be the checkpoint interval between ti and 
t, +1 . When the failure follows the Weibull distribution, if the shape 
parameter fi>\, I(i) is decreasing, and if the shape parameter 
/?< 1, I(i) is increasing. 

Proof. According to (27), when the failure follows the Weibull 
distribution, we have 



B+l £ti 



' 2A w u 



By induction, we obtain. 



2 
<+l 



(30) 



ft+1 2 fl+1 2 

7(0 = t t+l - 1, = [(/+ 1)^]^ - li^r-}^ 

ZA^u ^wbl 

^lii^.^.^+D^-^/T+T- 1 )] 



(37) 



'2A 



wbl 



)8+l 



(31) 



where i = 0,1,2,..., and to = 0. 

Theorem 3. Let tfi = 1,2,...) be the checkpoint placement time, 
such that to = 0. When the failure distribution follows the 
exponential distribution, i, can be expressed as 




(m+l)(n+l)\p„(mn+m+n+k) + (l —p„)k] 
O n + mO,„ + (m+ \)nOi 
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„ 25 



u=0.05 
u=0.25 
u=0.45 
u=0.65 
u=0.85 
u=0.95 



H — EH — EH — EH — EH — EH — EH — EH — EH — EH 
0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 

The probability Pn 

Figure 3. The relationship between optimal number of i- 
checkpoints and p n under different u= O/O^ 
doi:1 0.1 371 /journal.pone.01 04591 .g003 



Solving the first derivation of 1(f), we have 



7'(0 = ^70 =^[(i+ l)4i - ff + l ]( 



2A 



0+1 



■[('"+ 1) 



H+l 



1 2 
) H + 1 



(38) 



From (38), we can see that if /?> 1, l'(i)<0, that is to say, the 
checkpoint interval 1(f) is decreasing, and if fi < 1 , the checkpoint 
interval I(i) is increasing. 

Note: If P=l, the Weibull distribution turns into exponential 
distribution, the related conclusions are shown in the following. 

Theorem 5. Let I(i) be the checkpoint interval between t { and 
tj + i. When the failure follows the exponential distribution, I(i) is 
constant unrelated with the i. 

Proof. According to (27) when the failure follows the 
exponential distribution, we have. 



c 25 



20 



.2 

O 

£ 



£ 5 



73 
S 



h — i 



£ ]□[]□ 



-B— Pn=0.05 

— Pn=0.15 

— Pn=0.25 
H Pn=0.35 

I — Pn=0.45 

— Pn=0.5 



9 — f — 9- 



[]□[]□ [] 



0.05 0.15 0.25 0.35 0.45 0.55 0.65 0.75 0.85 0.95 1 
The checkpoint ratio u=Oi/Om 

Figure 4. The relationship between optimal number of 
/-checkpoints and u under different p„. 

doi:1 0.1 371 /journal.pone.01 04591 .g004 



I(i) = t i+l -t r - 



(39) 



where 



(m+ l)(n+ \)\]}„(mn + m + n + k) + (l —p„)k] 
0„ + mO m + (m + \)nOi 



y/l 



From (38), we can see that I(i) is constant unrelated with the i. 

Here, we only take the Weibull distribution and exponential 
distribution as examples to illustrate how to calculate the two-level 
incremental checkpoint placement time, and analyze the nature of 
the checkpoint interval for the Weibull distribution and exponen- 
tial distribution. When the failure follows other distribution types, 
the checkpoint placement time also can be calculated by (1). 

After the checkpoint placement time for the Weibull distribution 
and exponential distribution are calculated by (27) and (32), the 
expected re-computing time coefficient k can be calculated using 
the step 5 of the algorithm in Section 3.4, and then through the 
judgment of step 6, the ft, m m ; n , w min and the checkpoint placement 
sequences t\, t 2 ,..., t N can be determined. In this way the 
checkpoint placement strategy for the Weibull and exponential 
distribution is determined finally. 

Performance Analyses 

For the two-level checkpoint recovery scheme which is used to 
the large scale and long-running tasks, huge memory context must 
be transferred through the network and saved in a reliable storage. 
So the overheads of setting checkpoints and the re-computing time 
directly impact the system total overheads and the system 
performance. In order to further reduce the overheads of setting 
checkpoints, the re-computing time, the system total overheads, 
and make the scheme be applied to any type of failure distribution, 
we present a two-level incremental checkpoint recovery scheme 
based on the ideal that using checkpoint with high setting 
overheads to deal with the less probable or infrequent failures and 
using checkpoint with low setting overheads to deal with the more 
probable or frequent failures. 

However, for the traditional one-level checkpoint recovery 
scheme, the two-level checkpoint recovery scheme and our two- 
level incremental checkpoint recovery scheme, they all have their 
own merits or demerits. In the one-level checkpoint recovery 
scheme, either full or incremental checkpoint, only one kind of 
failure has been taken into account, and thus its placement strategy 
is simple. When a failure occurs, what we should do is just to 
recover the system for the latest checkpoint. However, this method 
cannot deal with different failures in different way, which 
forecloses the aim of the optimal performance. Compared with 
the one-level checkpoint recovery scheme, the two-level scheme 
can deal with the two different failures and achieve the more 
optimal performance. But it makes the placement strategy more 
difficult, because two kinds of checkpoints should be considered. 
Besides, it cannot distinguish the failures with different frequency. 
Our two-level incremental checkpoint recovery scheme adopted 
three kinds of checkpoints to deal with the above failures to 
achieve the optimal performance. But, the larger the number of 
kinds of checkpoints, the more difficult the checkpoint placement 
strategy becomes. In a word, compared to the two-level checkpoint 
recovery scheme, the proposed scheme significandy reduces the 
overheads of setting checkpoints and the re-computing time, and 
thereby reduces the system total overheads. In addition, this paper 
deduces the global optimal checkpoint overheads function and 
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Table 2. The parameters of two-level incremental checkpoint recovery scheme. 





Parameter 


Pn 


^ini 


o„ 


o m 


o, 




Rm 


Ri 


Value 


0.05 


0.5 


1.0 


0.1 


0.005 


1.0 


0.1 


0.005 



doi:1 0.1 371 /journal.pone.01 04591 .t002 



solves the problem that how to determine the optimal checkpoint 
placement strategy through. 

To evaluate performance of our scheme, in this section, we first 
discuss the factors affecting the number of z-checkpoints between 
two neighboring iV-checkpoints, and then analyze and show the 
advantages of the two-level incremental checkpoint recovery 
scheme compared with the two-level checkpoint recovery scheme. 

1 Factors Affecting Optimal Number of Checkpoint in 
One Segment 

The number of i-checkpoints between two neighboring N- 
checkpoints is the key factor that determines the checkpoint 
placement and affects the re-computing time and the system total 



overheads. If the number of i-checkpoints is obtained, with the 
value of the parameter^, we can obtain the related number of m- 
checkpoints. In this case, the checkpoint placement strategy is 
determinate. So, we will mainly give a mathematical analysis and 
conclusions about the optimal number of i-checkpoints between 
two neighboring iV-checkpoints in our checkpoint placement 
strategy. 

According to Section 3.3, we can determine the optimal number 
of ^-checkpoints between two neighboring iV-checkpoints with 
specific parameters. The tendency of the optimal number of i- 
checkpoints will be shown visually by several groups of examples in 
Fig. 3. The parameter /;„ is the probability of recovering from an 
2V-checkpoint, and u = OJO, n is the ratio of the overheads of 
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- Two-level [lKiemcnial Kecovery Scheme 
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(b) The Number of Failures (Times) 
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(d) The Number of Failures (Times) 



Figure 5. The comparison results between two-level incremental checkpoint recovery scheme and two-level checkpoint recovery 
scheme for the Weibull distribution, (a) The relationship between the total overheads of setting checkpoints and the number of failures; (b) The 
relationship between the total re-computing time and the number of failures; (c) The relationship between the total overheads of recovering from 
the failures and the number of failures; (d) The relationship between the system total overheads and the number of failures; Note: Time Unit depends 
on parameters in practical implementation, such as the practical value of O m , so it is not given here, which is similar to the case in Figs. 6-10. 
doi:1 0.1 371 /journal.pone.01 04591 .g005 
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Figure 6. The comparison results between two-level incremental checkpoint recovery scheme and two-level checkpoint recovery 
scheme for exponential distribution, (a) The relationship between the total overheads of setting checkpoints and the number of failures; (b) The 
relationship between the total re-computing time and the number of failures; (c) The relationship between the total overheads of recovering from 
the failures and the number of failures; (d) The relationship between the system total overheads and the number of failures. 
doi:1 0.1 371 /journal.pone.01 04591 .g006 



setting z-checkpoint and m-checkpoint. The range of u is (0,1). 
Note that we do not care the occasions when u = 0 and u — 1 . 
When u = 0, it means the overhead of z-checkpoint is 0; when 
u — 1 , it means the overhead of z-checkpoint is equivalent to the 
overhead of zzz-checkpoint. In both cases, the two-level incremental 
checkpoint recovery scheme will degenerate to the two-level 
checkpoint recovery scheme. The range of p n is (0, 1). The case, 
p n = 0 or p n = 1, means only permanent failure or only transient 
failure occurs in the system, which does not match the actual 
situations, so we do not consider these two cases either. In fact, 
during the practical running, the checkpoint recovery scheme 
must be affected by other factors, such as network throughout and 
I/O interaction [30], [31]. However, the existing schemes 
[15,16,18] just considered the main factors that affect the system 
performance basically and their experiments ignore the affection 
of them. Now, there have been some other researches that study 
their affection on the checkpoint recovery scheme, which has been 
considered as another new and independent research topic. Also in 
our experiment, to achieve the performance comparison between 
the existing schemes and ours in the same circumstance, we also 
ignore these factors like [15,16,18]. Studies of the affection of these 
factors is not our contribution of this paper, and may be one of our 
further works. 



Fig. 3 shows that the optimal number of z-checkpoints between 
two neighboring iV-checkpoints varies with the parameter p n for a 
given value u according to our placement strategy. As shown in 
Fig. 3, for a given value u, the greater probability p„ is, the smaller 
the optimal number of z-checkpoints is. When p n is large enough, 
no z-checkpoint is taken. For example when u = 0/O m = 0.05 and 
p„ = 0.5, there is no z-checkpoint in the proposed scheme and there 
only exist iV-checkpoint and m-checkpoint, and now the two-level 
incremental checkpoint recovery scheme degenerates to the two- 
level checkpoint recovery scheme. This is because when the 
probability p n of the permanent failure rises, which means that the 
permanent failure occurs frequently. And in this case, the system 
only can recover from the 2V-checkpoint, which results in that the 
placement of z-checkpoint becomes less and less. In order to 
reduce the system overheads, the z-checkpoint should be set less 
and less until it disappears. 

The following Fig. 4 shows that the optimal number of i- 
checkpoints between two neighboring ^-checkpoints varies with 
the checkpoint ratio u for a given value p n according to our 
placement strategy. 

As shown in Fig. 4, the comparison results of several curves of 
different value p n show that that the greater checkpoint ratio u is, 
the smaller the optimal number of z-checkpoint for a given 
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Figure 7. The comparison results between two-level incremental checkpoint recovery scheme and two-level checkpoint recovery 
scheme for the Weibull distribution, (a) The relationship between the total overheads of setting checkpoints and the completion time; (b) The 
relationship between the total re-computing time and the completion time; (c) The relationship between the total overheads of recovering from the 
failures and the completion time; (d) The relationship between the system total overheads and the completion time. 
doi:1 0.1 371 /journal.pone.01 04591 .g007 



parameter p n is. Especially when the value u is large enough, no i- 
checkpoint is scheduled. For example, when p n = 0.25 and u — OJ 
O m — 0.9, there will be no ^-checkpoints and there only exist the N- 
checkpoint and m-checkpoint, and now the two-level incremental 
checkpoint recovery scheme degenerates to the two-level check- 
point recovery scheme. This is because when the value u becomes 
larger and larger, which results that the overheads of setting i- 
checkpoints become larger and larger and gradually approach the 
overheads of setting m-checkpoint. The above situation results in 
that the contents of z -checkpoint are approximately equal to those 
of m-checkpoint, so the i-checkpoint gradually changes to m- 
checkpoint until it disappears. 

2 Performance Comparisons 

In this section, we use three groups of experiments to analyze 
the advantages of two-level incremental checkpoint recovery 
scheme compared to the two-level checkpoint recovery scheme. 
All these three groups of experiments are carried out under the 
Weibull distribution and exponential distribution. The first group 
of experiments compares the total overheads of setting check- 
points, the total re-computing time, the total overheads of 
recovering from failures, the system total overheads with the 
numbers of the failure between the two-level increment checkpoint 



recovery scheme and the classical two-level checkpoint recovery 
scheme [8] . The second group of experiments compares the total 
overheads of setting checkpoints, the total re-computing time, the 
total overheads of recovering from failures, the system total 
overheads with the task completion time between these two 
schemes. The third group of experiments compares the total 
overheads of setting checkpoints, the total re-computing time, the 
total overheads of recovering from failures, the system total 
overheads with the numbers of the failure under the different 
checkpoint ratio u between these two schemes. The system total 
overheads refer to the sum of total overheads of setting 
checkpoints, total re-computing time and total overheads of 
recovering from failures. 

Our simulations are based on the 22 high-performance 
computing systems in LANL (Los Alamos National Labs) from 
February 23, 1997 to September 2, 2004, which is a period of 
round 3,958,008 minutes and has 514 failures. When the failure 
follows the exponential distribution, like [8], we also assume the 
rate parameter of permanent failure kp =10 ' and the rate 
parameter of transient failure /./ = 10 -6 . When the failure follows 
the Weibull distribution, we make the best fitted Weibull 
distribution to the Nodel's failure datum of System2 in LANL 
from February 23, 1997 to December 10, 2004, and obtain the 
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fitted shape parameter /? = 0.6857 and the scale parameter 
a = 2.0815. And the other parameters are shown in Table 2. 

2.1 Performance Comparisons under the Failure 
Numbers. Firstly, we respectively compare the total overheads 
of setting checkpoints, the total re-computing time, the total 
overheads of recovering from failures and the system total 
overheads with the numbers of the failure between the two-level 
increment checkpoint recovery scheme and the classical two-level 
checkpoint recovery scheme. The comparison results are shown in 
Fig. 5 and Fig. 6. 

From Fig. 5 and Fig. 6, we can see that for both the Weibull 
distribution and exponential distribution, the total overheads of 
setting checkpoints, the total re-computing time and the system 
total overheads in our two-level incremental checkpoint recovery 
scheme are all less than those in the two-level checkpoint recovery 
scheme. Only the total overheads of recovering from failures in 
our two-level incremental checkpoint recovery scheme is slightly 
larger than the two-level checkpoint recovery scheme, this is 
because when the transient failure occurs, the two-level checkpoint 
recovery scheme needs to read the z'-checkpoint after the last N- 
checkpoint or m-checkpoint, which increases the recovery 
overheads. However, the growth of total overheads of recovering 



from failures is negligible compared to the reduction of the other 
aspects. 

2.2 Performance Comparisons under the task completion 
time. Next, we respectively compare the total overheads of 
setting checkpoints, the total re-computing time, the total 
overheads of recovering from failures and the system total 
overheads with the task completion time between the two-level 
increment checkpoint recovery scheme and the classical two-level 
checkpoint recovery scheme. The comparison results are shown in 
Fig. 7 and Fig. 8. 

From Fig. 7 and Fig. 8, we can see that for both the Weibull 
distribution and exponential distribution, the total overheads of 
setting checkpoints, the total re-computing time and the system 
total overheads in our two-level incremental checkpoint recovery 
scheme are all less than the two-level checkpoint recovery scheme. 
Only the total overheads of recovering from failures in our two- 
level incremental checkpoint recovery scheme is slightly larger 
than the two-level checkpoint recovery scheme, this is also because 
when the transient failure occurs, the two-level checkpoint 
recovery scheme needs to read the z-checkpoint after the last N- 
checkpoint or m-checkpoint, which increases the recovery 
overheads. However, the growth of total overheads of recovering 
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from failures is negligible compared to the reduction of the other 
aspects. And the longer the task completion time is, the larger the 
advantage of our proposed scheme in reducing the system total 
overheads is, which shows that our proposed recovery scheme is 
more suitable for long-running application task and can obtain the 
lower system total overheads. 

2.3 Performance Comparisons under different 
checkpoint ratio. From the above analyses, we know that the 
proposed scheme reduces the overhead of the system total 
overheads, re-computing time and the overheads of the setting 
checkpoints through introducing the z-checkpoint with low setting 
overheads. Next, through comparing the system total overheads 
with the task completion time between two-level checkpoint 
recovery scheme and two-level checkpoint recovery scheme under 
the Weibull distribution and exponential distribution, we show 
how the checkpoint ratio influences the system total overheads, 
and then show how the i-checkpoint influences the system total 
overheads. The checkpoint ratio u = OJO m is the ratio of the 
overheads of setting i-checkpoint and m-checkpoint. 

From Fig. 9 and Fig. 10, we can see that when the value u is 
small, the system total overheads of two-level incremental 
checkpoint recovery scheme for both failure distribution types 
are smaller than those of the two-level checkpoint recovery 



scheme, for example, under the situation w<15% for the Weibull 
distribution and M<30% for exponential distribution. When the 
value u approaches some threshold, each checkpoint recovery 
scheme has its own advantages respectively, for example, when u 
approaches 15% for the Weibull distribution (the two curves 
coincide approximately) and 33% for exponential respectively. 
When the value u is larger than this threshold, the system total 
overheads of two-level incremental checkpoint recovery scheme 
for both failure distribution are larger than the two-level 
checkpoint recovery scheme, for example, when M>15% for the 
Weibull distribution and M>30% for exponential distribution. 
These conclusions are consistent with the results of the Fig. 4. This 
is because when the value u increases to a certain value, the 
overheads of setting i-checkpoints approach the overheads of 
setting m-checkpoint, which results in that the contents of i- 
checkpoint are approximately equal to the contents of m- 
checkpoint. So the z'-checkpoint loses the advantage of low setting 
overhead gradually, and therefore, the advantage of the two-level 
incremental checkpoint becomes less and less. 

In conclusion, when the value u is small, compared to the two- 
level checkpoint recovery scheme, the longer the time of long- 
running application is, the larger the advantage of our proposed 
scheme is, the larger the reduction of the system total overheads is, 



PLOS ONE | www.plosone.org 



14 



August 2014 | Volume 9 | Issue 8 | e104591 



Two-Level Incremental Checkpoint Recovery Scheme 



Two level Recovery Scheme 
Two-level Incremental Recovery Scheme 


































y 









































(a) Completion Time (Mins) 



u 

I 



Two level Recovery Scheme 

Iwo level liuiemen^il Reeovery Scheme 











































































(b) 

Completion Time (Mins) 



kv.O liVil Recovery ScketllC 

1 «(i k vel liu emc:r,il Hi-i every Sckeme 



Two-level Recovery Scheme 
- Two-leve] kieremer.kil recovery Srkcoie 



0 1 1 1 ' 1 1 0 1 1 1 1 1 1 1 

0123456 (1123456 

(c) Completion Time (Mins) (d) Completion Time (Mins) kio' 

Figure 10. For exponential distribution, comparisons of the system total overheads between two-level incremental checkpoint 
recovery scheme and two-level checkpoint recovery scheme, (a) u = 0/O m = 10%; (b) u = 0/O m = 30%; (c) u = 0/O m =33%; (d) u = 0/ 
O m = 40%. 

doi:1 0.1 371 /journal.pone.01 04591. g010 



and the better the performance of the proposed scheme is. The 
introduced z-checkpoint in our proposed scheme only save the 
application states that have changed since the previous checkpoint, 
while the m-checkpoint save the total states of the application. The 
stored contents of the changed states are much lower than the total 
states, so the overhead of setting an z-checkpoint is much lower 
than the m-checkpoint. Therefore, the checkpoint ratio u can be 
kept in a small value. So, our two-level incremental checkpoint 
recovery scheme has the better performance than the two-level 
checkpoint recovery scheme. 

Limitations of the study, open questions, and 
future work 

The checkpoint recovery technology has been considered as a 
promising technique to tolerate the system failures, guarantee the 
system reliability, and ensure the successful completion of the long- 
running tasks, and lots of checkpoint recovery schemes have been 
proposed recendy. In this paper, based on the two-level checkpoint 
recovery idea, a two-level incremental checkpoint recovery scheme 
is proposed to further reduce the system total overheads. Three 
types of checkpoints, say ./V-checkpoint, m-checkpoint and i- 
checkpoint, are used in our scheme. The .(V-checkpoint is used to 



deal with the less probable or infrequent failures, while the m- 
checkpoint and z-checkpoint are used to deal with the more 
probable or frequent failures. Experiment results show that 
compared to the two-level checkpoint recovery scheme, the 
proposed scheme significantly reduces the transfers of the storing 
contents, the overheads of setting checkpoints and the re- 
computing time, and thereby reduces the system total overheads. 

Unfortunately, there are still limitations in our study. Like 
Vaidya's study on the two-level checkpoint recovery scheme [8], 
our contribution is also a theoretical idea. When Vaidya 
introduces his/her work, he/she just considered the ideal case 
and took the main performance factors into account without any 
practical application implemented. This does simplify the problem 
and pay attention to the main factors that affect the system 
performance basically [15,16,18]. Therefore, in our paper, we also 
adopt the same assumptions used in the works [8,15,16,18] and 
the performance analyses focus on these main factors. This enables 
us to compare our scheme with the existing ones in the same 
circumstance, but we all know that the system performance heavily 
depends on the characteristics of the applications being studied. In 
fact, during the practical running, the checkpoint recovery scheme 
must be affected by other factors, such as network throughout and 
I/O interaction [30,31]. Although some studies [8,15,16,18] just 
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considered the main factors and their experiments ignored the 
affection of those application-related factors, there have been some 
other researches that study their affection on the checkpoint 
recovery scheme [26,27], which can be considered as another new 
and independent research topic. Our work focuses on the idea of 
the two-level incremental checkpoint recovery, and studies of the 
affection of these application-related factors are not our contribu- 
tion of this paper. 

Based on our study, four main research questions remain open 
and unsolved. The first is to implement our scheme in some 
practical application and explore how those application-related 
factors, such as network throughout and 1/ O interaction, affect the 
system performance. The second is to find an effective checkpoint 
placement method because the placement in our scheme is clearly 
more difficult than that in traditional one-level or two-level 
scheme. The third is to consider how to improve our scheme in the 
special case that the local storage is not error-free. The last but the 
most enjoyable is to introduce our idea into the multi-level 
checkpointing system [19] to show if a good result can be 
obtained. Still, our future work shall firstly focus on the 
implementation of our scheme in a practical system and show 
how the application-related factors affect the system performance. 
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